Using Machine Learning to Analyze the 2nd Round of the 2022 Brazilian Presidential Election

Author: Yuri Henrique Galvao - Student # 3151850


This is the Final Project for the Artificial Intelligence Diploma program of The University of Winnipeg - Professional, Applied and Continuing Education (PACE). The idea of this project is to extract the available data generated by the Electronic Voting Machines (EVM) that were used in the 2nd Round of the 2022 Brazilian Presidential Election, clean it, analyze it, and then use clustering models to find data patterns - especially hidden or non-intuitive patterns - and anomalies.

For these clustering and anomaly-detection tasks, I will use the following three unsupervised clustering algorithms, which will be presented and compared: K-Means, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and Self-Organizing Maps (SOMs).

This notebook is divided into 6 main sections. At the end of the notebook, you will find also the references for the content that I used during the preparation of this notebook. In order to increase readability, some code is hidden (so, please unhide them if you want).

SECTIONS:¶

  1. Introduction
  2. Data Ingestion
  3. Exploratory Data Analysis
  4. Clustering
    4.1 K-Means
    4.2 DBSCAN
    4.3 SOM
  5. Conclusions
  6. References

1. Introduction ^¶

1.1. Context¶

In October of the current year (2022) Brazil had the fiercest presidential election of the past three decades. During the first round of federal elections, more than 10 candidates ran for President of the Federative Republic of Brazil, but all Brazilians knew that only two candidates would have real chances of going to the second round: Luis Inácio “Lula” da Silva and Jair Messias Bolsonaro.

As everyone expected, the presidential elections went to the second round with Lula and Bolsonaro, and Brazil had a very controversial election in which Lula won with 50.9% of the valid votes against 49.1% of the valid votes that Bolsonaro received.

Those election were controversial because, among with other issues, Brazil uses an 100% electronic voting system, which is based on a digital electronic voting machine (EVM) and in a digital voting processing system which is considered a black-box system.

1.2. Project Idea¶

In an attempt to make the system more transparent, the public organization responsible for the elections, the Superior Electoral Court (Tribunal Superior Eleitoral – TSE), published all the data gathered by the EVM in their own website. More precisely, it is possible to get the logs, the digital vote registry (Registro Digital de Voto – RDV), and the voting machine bulletin (Boletim de Urna – BU), of each machine used in the elections.

In this project I will exctract raw data from the log files and the BUs, transform it, load it into Pandas DataFrames, clean it, analyze it, and then use clustering models to find hidden or non-intuitive patterns and anomalies. Nevertheless, those patterns will be a great finding and should tell a lot about the dataset.

For that, this notebook focuses on three algorithms: K-Means, DBSCAN and SOMs. They are implemented using mainly two Python libraries: Scikit-Learn and MiniSom.

2. Data Ingestion ^¶

2.1. Importing the necessary libraries for the ETL process¶

In [24]:
import pandas as pd
import numpy as np
import glob, py7zr, os

2.2. Extracting the Data¶

2.2.1. Web Scraping¶

To web scrap TSE's website and get the zip files of each EVM, I had do develop a Selenium-based web scraper. This webscraper it able to download all EVMs from every single Federative Unit (also called "scope"). Since Brazil has 26 states, 1 Federal District, and 1 scope for the EVMs placed in other countries, the total of scopes / federative units is 28. Therefore, you can run 28 instances of this web scraper to speed up the process of downloading the zip files.

In [21]:
!git clone https://github.com/ygalvao/2022brascraper.git
Cloning into '2022brascraper'...
remote: Enumerating objects: 31, done.
remote: Counting objects: 100% (31/31), done.
remote: Compressing objects: 100% (23/23), done.
remote: Total 31 (delta 14), reused 24 (delta 7), pack-reused 0
Receiving objects: 100% (31/31), 8.20 MiB | 9.32 MiB/s, done.
Resolving deltas: 100% (14/14), done.
In [27]:
!python 2022brascraper/web_scraper_en.py
Please, enter the name of the state for this instance of the web scraper (e.g. Rio de Janeiro): ^C
Traceback (most recent call last):
  File "/home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/2022brascraper/web_scraper_en.py", line 19, in <module>
    scope = input('Please, enter the name of the state for this instance of the web scraper (e.g. Rio de Janeiro): ').strip()
KeyboardInterrupt

2.2.2. Unzipping¶

The Bash script below automates the process of unzipping the .bu and .logjez files. It will extract the files in the "extracted" subfolder.

In [ ]:
!./extract_bu_and_log.sh
Archive:  ./BU_e_RDV/DadosDeUrna_o00407-0001900010003_1673243086911.zip
 extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-0001900010003.bu  
 extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-0001900010003.logjez  
Archive:  ./BU_e_RDV/DadosDeUrna_o00407-0001900010003_1673269612030.zip
 extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-0001900010003.bu  
 extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-0001900010003.logjez  
Archive:  ./BU_e_RDV/DadosDeUrna_o00407-0001900010004_1673243094855.zip
 extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-0001900010004.bu  
 extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-0001900010004.logjez  
(...)
Archive:  ./BU_e_RDV/DadosDeUrna_o00407-9005000080063_1673201365039.zip
 extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-9005000080063.bu  
 extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-9005000080063.logjez  

2.3. Transforming the Data¶

The BU files (.bu) are in ASN.1 format, which is not readable by humans nor by Python (at least not natively). Therefore, the Bash script below automates the process of dumping the data from the ASN.1 format files into readable flat files (.txt).

Moreover, the "bu_etl" function transforms the data from a BU text file and stores it into a Pandas DataFrame (then returns it).

In [ ]:
!./dump_bu.sh
2023-01-09 21:52:00,620 - INFO - Converte ./BU_e_RDV/extracted/o00407-0001900010003.bu com as especificações ['bu.asn1']
2023-01-09 21:52:00,868 - INFO - Converte ./BU_e_RDV/extracted/o00407-0001900010004.bu com as especificações ['bu.asn1']
2023-01-09 21:52:01,115 - INFO - Converte ./BU_e_RDV/extracted/o00407-0001900010005.bu com as especificações ['bu.asn1']
(...)
2023-01-09 23:31:29,765 - INFO - Converte ./BU_e_RDV/extracted/o00407-1583000350047.bu com as especificações ['bu.asn1']
In [28]:
def bu_etl(file_path:str)->pd.DataFrame:
    """
    Transforms the data from a BU text file and stores it into a Pandas DataFrame.
    
    Arg.: (str) the path for the BU text file 
    
    Returns: a Pandas DataFrame.
    """
    
    # Read the .txt file
    with open(file_path) as f:
        data = f.read()

    # Split the data into lines
    lines = data.split('\n')

    # Initialize two empty lists to store the rows of the DataFrame
    data_dict = {}
    codes = []
    brancos_e_nulos = 0

    # Iterate through the lines of the file
    for i, line in enumerate(lines):

        try:
            # Split the line by ' = ' and extract the left and right parts
            left, right = line.split(' = ')
        except:
            continue

        # Split the left part by '.', extract the last element (the column name), and delete white spaces before and after it(if there is any)
        column_name = left.split('.')[-1].strip()

        # Get the desired data and stores it in a dictionary
        wanted_variables = (
            'fase',
            'local',
            'municipio',
            'zona',
            'secao',
            'qtdEleitoresCompBiometrico',
            'idEleicao',
            'qtdEleitoresAptos',
            'qtdComparecimento',
            'qtd_votos_13',
            'qtd_votos_22',
            'brancos_nulos',
            'versaoVotacao'
            )

        value = right.strip("'")

        if column_name == 'codigo':
            codes.append(value)

        if column_name == 'quantidadeVotos' and len(codes) == 1 and (i > 34 and i < 40):
            column_name = f'qtd_votos_{codes[0]}'
            codes.pop()

        if column_name == 'quantidadeVotos' and len(codes) == 1 and (i > 40  and i < 46):
            column_name = f'qtd_votos_{codes[0]}'
            codes.pop()

        if column_name == 'quantidadeVotos' and len(codes) == 0 and (i >= 46):
            column_name = 'brancos_nulos'
            brancos_e_nulos += int(value)
            value = brancos_e_nulos

        if column_name in wanted_variables:
            data_dict[column_name] = [value]

    # Create the DataFrame from the rows
    df = pd.DataFrame(data_dict)
    
    return df

2.4. Loading the Data¶

Below are the procedures (mainly loops) to, finally, load the data into DataFrames that can be used by us, by Scikit-Learn, and by MiniSom.

In [35]:
# Importing the data from the BUs into a Pandas DataFrame
files_df2 = glob.glob("./BU_e_RDV/extracted/*.txt")
df_bu_list = []

## This loop will 
for i, file_path in enumerate(files_df2):
    bu = bu_etl(file_path)
    df_bu_list.append(bu)
    os.remove(file_path)
    
    if i % 50 == 0:            
        df_bu = pd.concat(df_bu_list, ignore_index=True)
        df_bu.to_csv('df_bu.csv') # Saves the DF to a CSV file, so we don't need to run all the ETL process again in the future
        df_bu_list = [df_bu]

if len(df_bu_list) > 1:
    df_bu = pd.concat(df_bu_list, ignore_index=True)
    df_bu.to_csv('df_bu.csv') # Saves the DF to a CSV file, so we don't need to run all the ETL process again in the future
In [11]:
# Importing the data from the log files into a Pandas DataFrame
files_df1 = glob.glob("./BU_e_RDV/extracted/*.logjez") # This is where I stored all the 7Zip logfiles from the "extract_bu_and_log.sh" script
df_logs_list = []

## This loop will exctract the logs (in flat file format, .dat) and correctly import its data into a Pandas DataFrame
## At the end of  each iteraction, it will delete the recently extracted .dat file (which has around 700kB)
## and the original 7Zip file (the .logjez files), in order to save space in disk
for i, file_path in enumerate(files_df1):
    logjez_file = py7zr.SevenZipFile(file_path, mode="r")
    logjez_file.extractall(path="./BU_e_RDV/extracted/")
    logjez_file.close()
    log = pd.read_csv(
        './BU_e_RDV/extracted/logd.dat',
        encoding='ISO 8859-1',
        header=None,
        names=['date_time', 'event_type', 'id_evm', 'system', 'description', 'authenticator'],
        sep=None,
        engine='python'
    )
    df_logs_list.append(log)
    os.remove('./BU_e_RDV/extracted/logd.dat')
    os.remove(file_path)
    
    if i % 50 == 0:
        df_logs = pd.concat(df_logs_list, ignore_index=True)
        df_logs.to_csv('df_logs.csv') # Saves the DF to a CSV file, so we don't need to run all the ETL process again in the future
        df_logs_list = [df_logs]
        
    if i > 20000:
        break

if len(df_logs_list) > 1:
    df_logs = pd.concat(df_logs_list, ignore_index=True)
    df_logs.to_csv('df_logs.csv') # Saves the DF to a CSV file, so we don't need to run all the ETL process again in the future
---------------------------------------------------------------------------

KeyboardInterrupt 
In [23]:
# In case of crash of this notebook (or if you already have the CSV files), 
#you can just read the CSV files that were created in the previous two cells.
# For that, just run this cell.
# Nevertheless, running this cell when it is not necessary is safe and harmless.
df_logs = pd.read_csv('df_logs.csv', index_col=0)
df_bu = pd.read_csv('df_bu.csv', index_col=0)
In [37]:
df_logs.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2257041 entries, 0 to 2257040
Data columns (total 5 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   date_time    object
 1   event_type   object
 2   id_evm       int64 
 3   system       object
 4   description  object
dtypes: int64(1), object(4)
memory usage: 103.3+ MB
In [38]:
df_bu.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 34644 entries, 0 to 34643
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 0   local                       34644 non-null  int64
 1   municipio                   34644 non-null  int64
 2   zona                        34644 non-null  int64
 3   secao                       34644 non-null  int64
 4   qtdEleitoresCompBiometrico  34644 non-null  int64
 5   qtdEleitoresAptos           34644 non-null  int64
 6   qtdComparecimento           34644 non-null  int64
 7   qtd_votos_13                34644 non-null  int64
 8   qtd_votos_22                34644 non-null  int64
 9   brancos_nulos               34644 non-null  int64
dtypes: int64(10)
memory usage: 2.9 MB

3. Exploratory Data Analysis ^¶

3.1. Overall Analysis of the Logs DataFrame¶

In [25]:
from IPython.display import display as show
In [52]:
show(df_logs.head(10))
show(df_logs.tail(10))
date_time event_type id_evm system description authenticator
0 23/09/2022 09:46:41 INFO 67305985 LOGD Início das operações do logd 5A7DA76DD475845C
1 23/09/2022 09:46:41 INFO 67305985 LOGD Urna ligada em 23/09/2022 às 09:45:21 624DDD7828CC05C0
2 23/09/2022 09:46:41 INFO 67305985 SCUE Iniciando aplicação - Oficial - 1º turno 81D6CC6DB5FDDCCC
3 23/09/2022 09:46:41 INFO 67305985 SCUE Versão da aplicação: 8.26.0.0 - Onça-pintada 958EFE65A6117BF4
4 23/09/2022 09:46:43 INFO 67305985 SCUE Urna operando com rede elétrica 13130D8758EFB5EB
5 23/09/2022 09:46:43 INFO 67305985 SCUE Bateria interna com carga plena EE968B0C96E9D91B
6 23/09/2022 09:46:49 INFO 67305985 SCUE Tamanho da mídia interna: 488.7 MB 1E829998F4272762
7 23/09/2022 09:46:52 INFO 67305985 SCUE Tamanho da memória: 489.4 MB 114FCB81EB5E33C1
8 23/09/2022 09:46:53 INFO 67305985 SCUE Verificação de assinatura de aplicação por eta... E0F945D741906B3C
9 23/09/2022 09:46:53 INFO 67305985 SCUE Verificação de assinatura de aplicação por eta... 1ACBC4446E919E65
date_time event_type id_evm system description authenticator
44639068 30/10/2022 17:02:50 INFO 67305985 VOTA Gerando arquivo de resultado [.wsqbio] + [Térm... 02B9AA39FEFCFDD6
44639069 30/10/2022 17:02:50 INFO 67305985 VOTA Gerando arquivo de resultado [.wsqman] + [Início] EBE25D58F9D6EEA5
44639070 30/10/2022 17:02:50 INFO 67305985 VOTA Gerando arquivo de resultado [.wsqman] + [Térm... DEB08BE94048457D
44639071 30/10/2022 17:02:50 INFO 67305985 VOTA Gerando arquivo de resultado [.wsqmes] + [Início] C25D1B574367931F
44639072 30/10/2022 17:02:51 INFO 67305985 VOTA Gerando arquivo de resultado [.wsqmes] + [Térm... 8BD3F01DA05ADD28
44639073 30/10/2022 17:02:51 INFO 67305985 VOTA Gerando arquivo de resultado [.ver] + [Início] 42C25A3BA8669F70
44639074 30/10/2022 17:02:51 INFO 67305985 VOTA Gerando arquivo de resultado [.ver] + [Término] 1010DDA0D08CAF07
44639075 30/10/2022 17:02:51 INFO 67305985 VOTA Gerando arquivo de resultado [.chvtp] + [Início] F53CEA7E4F73FAD2
44639076 30/10/2022 17:02:51 INFO 67305985 VOTA Gerando arquivo de resultado [.chvtp] + [Término] BE5C958E224DFDC6
44639077 30/10/2022 17:02:51 INFO 67305985 VOTA Gerando arquivo de resultado [.logjez] + [Início] 557504AA3E902073
In [57]:
df_logs.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18639078 entries, 0 to 18639077
Data columns (total 6 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   date_time      object
 1   event_type     object
 2   id_evm         int64 
 3   system         object
 4   description    object
 5   authenticator  object
dtypes: int64(1), object(5)
memory usage: 853.2+ MB

We can see that each row in the df_logs DataFrame is one event of one of the systems running inside an EVM.

Due to feasibility purposes, this DataFrame is only a fraction (around 2%) of the data of all the EVM's logs.

3.1.1. Checking duplicated rows¶

In [58]:
df_logs.duplicated().sum()
Out[58]:
0

3.1.2. Checking null values¶

In [61]:
df_logs.isna().sum()
Out[61]:
date_time        0
event_type       0
id_evm           0
system           0
description      0
authenticator    0
dtype: int64

3.1.3. Checking zero values¶

In [59]:
(df_logs == 0).any(axis=1).sum()
Out[59]:
0
In [60]:
# Checking zero values on the columns
(df_logs == 0).any(axis=0)
Out[60]:
date_time        False
event_type       False
id_evm           False
system           False
description      False
authenticator    False
dtype: bool

3.1.4. Checking the possible values for each variable and their frequencies¶

In [62]:
for column in df_logs.columns:
    print(column+''''s different possible values quantity:''', df_logs[column].nunique(), '\n')
date_time's different possible values quantity: 556999 

event_type's different possible values quantity: 4 

id_evm's different possible values quantity: 761 

system's different possible values quantity: 10 

description's different possible values quantity: 133351 

authenticator's different possible values quantity: 18639078 

3.2. Overall Analysis of the BUs DataFrame¶

In [43]:
from IPython.display import display as show
In [66]:
show(df_bu.head(10))
show(df_bu.tail(10))
fase local municipio zona secao qtdEleitoresCompBiometrico idEleicao qtdEleitoresAptos qtdComparecimento qtd_votos_13 qtd_votos_22 brancos_nulos versaoVotacao
0 oficial 1406 18376 51 92 199 545 243 208 112 91 5.0 8.26.0.0 - Onça-pintada
1 oficial 1090 23086 34 110 247 547 295 258 200 43 15.0 8.26.0.0 - Onça-pintada
2 oficial 1112 2372 30 12 190 547 291 205 157 40 8.0 8.26.0.0 - Onça-pintada
3 oficial 1414 8052 35 71 185 545 281 220 191 25 4.0 8.26.0.0 - Onça-pintada
4 oficial 1031 17655 12 58 204 545 274 217 147 61 9.0 8.26.0.0 - Onça-pintada
5 oficial 1325 4014 7 188 238 545 310 268 218 44 6.0 8.26.0.0 - Onça-pintada
6 oficial 1066 477 21 310 166 547 300 185 42 142 1.0 8.26.0.0 - Onça-pintada
7 oficial 1074 9539 49 23 158 545 266 183 116 64 3.0 8.26.0.0 - Onça-pintada
8 oficial 1023 22314 27 18 217 547 284 225 157 42 26.0 8.26.0.0 - Onça-pintada
9 oficial 1708 15695 81 44 276 545 347 281 131 134 16.0 8.26.0.0 - Onça-pintada
fase local municipio zona secao qtdEleitoresCompBiometrico idEleicao qtdEleitoresAptos qtdComparecimento qtd_votos_13 qtd_votos_22 brancos_nulos versaoVotacao
61068 oficial 1058 54070 280 120 138 545 377 301 124 169 8.0 8.26.0.0 - Onça-pintada
61069 oficial 1635 58335 126 367 327 545 393 330 128 193 9.0 8.26.0.0 - Onça-pintada
61070 oficial 1392 39136 33 212 298 547 414 355 250 95 10.0 8.26.0.0 - Onça-pintada
61071 oficial 1120 54070 280 318 215 545 371 262 134 118 10.0 8.26.0.0 - Onça-pintada
61072 oficial 1210 58190 76 57 115 545 246 176 50 117 9.0 8.26.0.0 - Onça-pintada
61073 oficial 1040 58300 255 192 223 545 367 260 128 129 3.0 8.26.0.0 - Onça-pintada
61074 oficial 1171 58092 42 18 191 545 374 290 117 161 12.0 8.26.0.0 - Onça-pintada
61075 oficial 1090 39675 69 111 221 547 301 227 183 32 12.0 8.26.0.0 - Onça-pintada
61076 oficial 1120 54259 311 191 247 545 370 291 119 165 7.0 8.26.0.0 - Onça-pintada
61077 oficial 1066 73334 34 92 201 545 272 202 138 58 6.0 8.26.0.0 - Onça-pintada
In [67]:
df_bu.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61078 entries, 0 to 61077
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   fase                        61078 non-null  object 
 1   local                       61078 non-null  object 
 2   municipio                   61078 non-null  object 
 3   zona                        61078 non-null  object 
 4   secao                       61078 non-null  object 
 5   qtdEleitoresCompBiometrico  60764 non-null  object 
 6   idEleicao                   61078 non-null  object 
 7   qtdEleitoresAptos           61078 non-null  object 
 8   qtdComparecimento           61078 non-null  object 
 9   qtd_votos_13                60793 non-null  object 
 10  qtd_votos_22                59551 non-null  object 
 11  brancos_nulos               59380 non-null  float64
 12  versaoVotacao               61078 non-null  object 
dtypes: float64(1), object(12)
memory usage: 6.1+ MB

We can see that each row in the df_bu DataFrame is one EVM.

Due to feasibility purposes, this DataFrame is only a fraction (around 13%) of the data of all the EVM's logs.

3.1.1. Checking duplicated rows¶

In [40]:
df_bu.duplicated().sum()
Out[40]:
0

3.1.2. Checking null values¶

In [69]:
df_bu.isna().sum()
Out[69]:
fase                             0
local                            0
municipio                        0
zona                             0
secao                            0
qtdEleitoresCompBiometrico     314
idEleicao                        0
qtdEleitoresAptos                0
qtdComparecimento                0
qtd_votos_13                   285
qtd_votos_22                  1527
brancos_nulos                 1698
versaoVotacao                    0
dtype: int64

3.1.3. Checking zero values¶

In [70]:
(df_bu == 0).any(axis=1).sum()
Out[70]:
0
In [71]:
# Checking zero values on the columns
(df_bu == 0).any(axis=0)
Out[71]:
fase                          False
local                         False
municipio                     False
zona                          False
secao                         False
qtdEleitoresCompBiometrico    False
idEleicao                     False
qtdEleitoresAptos             False
qtdComparecimento             False
qtd_votos_13                  False
qtd_votos_22                  False
brancos_nulos                 False
versaoVotacao                 False
dtype: bool

3.1.4. Checking the possible values for each variable and their frequencies¶

In [72]:
for column in df_bu.columns:
    print(column+''''s different possible values quantity:''', df_bu[column].nunique(), '\n')
fase's different possible values quantity: 1 

local's different possible values quantity: 265 

municipio's different possible values quantity: 789 

zona's different possible values quantity: 195 

secao's different possible values quantity: 1054 

qtdEleitoresCompBiometrico's different possible values quantity: 421 

idEleicao's different possible values quantity: 2 

qtdEleitoresAptos's different possible values quantity: 523 

qtdComparecimento's different possible values quantity: 426 

qtd_votos_13's different possible values quantity: 333 

qtd_votos_22's different possible values quantity: 291 

brancos_nulos's different possible values quantity: 50 

versaoVotacao's different possible values quantity: 1 

In [96]:
df_bu.idEleicao.value_counts()
Out[96]:
545    34644
547    26434
Name: idEleicao, dtype: int64

3.3. Data Cleaning - Logs DF¶

3.3.1. Deleting unnecessary features¶

In [73]:
df_logs.drop(columns='authenticator', inplace=True)
In [75]:
df_logs.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18639078 entries, 0 to 18639077
Data columns (total 5 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   date_time    object
 1   event_type   object
 2   id_evm       int64 
 3   system       object
 4   description  object
dtypes: int64(1), object(4)
memory usage: 711.0+ MB

3.3.2. Correcting data types¶

In [46]:
df_logs['date_time'] = pd.to_datetime(df_logs.date_time)
In [47]:
for column in ['event_type', 'id_evm', 'system']:
    df_logs[column] = df_logs[column].astype('category')
In [79]:
df_logs.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18639078 entries, 0 to 18639077
Data columns (total 5 columns):
 #   Column       Dtype         
---  ------       -----         
 0   date_time    datetime64[ns]
 1   event_type   category      
 2   id_evm       category      
 3   system       category      
 4   description  object        
dtypes: category(3), datetime64[ns](1), object(1)
memory usage: 355.5+ MB

3.3.3. Deleting personal data¶

This step is for ethical reasons: it is just to delete some personal data from the poll workers / clerks that are in the log files, like social insurance numbers (Cadastro de Pessoa Física - CPF, in Brazil).

In [84]:
df_logs[df_logs.description.str.contains('mesário', case=False)]
Out[84]:
date_time event_type id_evm system description
438 2022-02-10 07:14:21 INFO 67305985 VOTA Operador indagado se ocorrerá registro de mesá...
439 2022-02-10 07:16:12 INFO 67305985 VOTA Operador confirmou para registrar mesários
440 2022-02-10 07:16:13 INFO 67305985 VOTA Registrando mesários antes da votação
441 2022-02-10 07:16:29 INFO 67305985 VOTA Pedido de leitura da biometria do mesário 0182...
442 2022-02-10 07:16:37 INFO 67305985 VOTA Realizada a conferência da biometria do mesário
... ... ... ... ... ...
18639044 2022-10-30 17:01:33 INFO 67305985 VOTA Mesário 048058530906 não é eleitor da seção
18639045 2022-10-30 17:01:34 INFO 67305985 VOTA Mesário 048058530906 registrado
18639046 2022-10-30 17:01:38 INFO 67305985 VOTA Operador encerrou ciclo de registro de mesários
18639055 2022-10-30 17:02:00 INFO 67305985 VOTA Mesário indagado sobre qualidade do Boletim de...
18639056 2022-10-30 17:02:37 INFO 67305985 VOTA Mesário indicou qualidade OK para Boletim de Urna

1743820 rows × 5 columns

In [92]:
def delete_cpf(description:str)->str:
    """Deletes only the CPF."""
    
    new = description.split(' ')
    for token in new:
        try:
            int(token)
        except:
            pass
        else:
            new.remove(token)
    
    new = ' '.join(new)
    
    return new
In [90]:
df_logs['description'] = df_logs.description.apply(delete_cpf)
In [45]:
df_logs[df_logs.description.str.contains('mesário', case=False)]
Out[45]:
date_time event_type id_evm system description event_type_enc date_time_enc system_enc description_enc
138 2022-07-10 08:43:20 INFO 67305985 VOTA Mesário indagado se deseja imprimir cópia do BU 3 52665 8 24839
821 2022-10-30 07:01:22 INFO 67305985 VOTA Operador indagado se ocorrerá registro de mesá... 3 127038 8 25281
822 2022-10-30 07:11:26 INFO 67305985 VOTA Operador confirmou para registrar mesários 3 127640 8 25277
823 2022-10-30 07:11:26 INFO 67305985 VOTA Registrando mesários antes da votação 3 127640 8 26971
824 2022-10-30 07:11:55 INFO 67305985 VOTA Pedido de leitura da biometria do mesário 3 127669 8 25287
... ... ... ... ... ... ... ... ... ...
2256962 2022-02-10 16:57:10 INFO 2074298 VOTA Título digitado pelo mesário 3 42355 8 27572
2256985 2022-02-10 16:58:55 INFO 2074298 VOTA Título digitado pelo mesário 3 42460 8 27572
2257001 2022-02-10 17:00:05 INFO 2074298 VOTA Título digitado pelo mesário 3 42530 8 27572
2257018 2022-02-10 17:01:04 INFO 2074298 VOTA Título digitado pelo mesário 3 42589 8 27572
2257033 2022-02-10 17:02:04 INFO 2074298 VOTA Título digitado pelo mesário 3 42649 8 27572

212030 rows × 9 columns

3.4. Data Cleaning - BUs DF¶

3.4.1. Deleting unnecessary features¶

In [97]:
df_bu.drop(columns='fase', inplace=True)
In [101]:
df_bu = df_bu[df_bu.idEleicao=='545'].copy() # The id for the 2nd Round of the 2022 Presidencial Election is 545
In [100]:
df_bu.drop(columns='idEleicao', inplace=True)
In [103]:
df_bu.drop(columns='versaoVotacao', inplace=True)
In [104]:
df_bu.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 34644 entries, 0 to 61077
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   local                       34644 non-null  object 
 1   municipio                   34644 non-null  object 
 2   zona                        34644 non-null  object 
 3   secao                       34644 non-null  object 
 4   qtdEleitoresCompBiometrico  34644 non-null  object 
 5   qtdEleitoresAptos           34644 non-null  object 
 6   qtdComparecimento           34644 non-null  object 
 7   qtd_votos_13                34580 non-null  object 
 8   qtd_votos_22                33843 non-null  object 
 9   brancos_nulos               33535 non-null  float64
dtypes: float64(1), object(9)
memory usage: 2.9+ MB

3.4.2. Replacing null values¶

After analyzing the null values in this DF, I came to the conclusion that they mean, actually, zero (0) values.

In [41]:
df_bu.fillna(0, inplace=True)

3.4.3. Correcting data types¶

In [48]:
for column in df_bu.columns.to_list():
    if column[:3] == 'qtd':
        df_bu[column] = df_bu[column].astype('int')
    else:
        df_bu[column] = df_bu[column].astype('category')
        
df_bu['brancos_nulos'] = df_bu['brancos_nulos'].astype('int')
In [45]:
df_bu.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 34644 entries, 0 to 34643
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype   
---  ------                      --------------  -----   
 0   local                       34644 non-null  category
 1   municipio                   34644 non-null  category
 2   zona                        34644 non-null  category
 3   secao                       34644 non-null  category
 4   qtdEleitoresCompBiometrico  34644 non-null  int64   
 5   qtdEleitoresAptos           34644 non-null  int64   
 6   qtdComparecimento           34644 non-null  int64   
 7   qtd_votos_13                34644 non-null  int64   
 8   qtd_votos_22                34644 non-null  int64   
 9   brancos_nulos               34644 non-null  int64   
dtypes: category(4), int64(6)
memory usage: 2.2 MB

3.5. Descriptive Statistics¶

In [95]:
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns
In [38]:
# Defining a function to classify columns into categorical and quantitative variables
def check_variables(df: pd.DataFrame) -> list:
    """
    Separates the categorical variables from the quantitative variables, and store them in their respective list.
    """
    
    cols = df.columns
    date_cols = df.select_dtypes(include='datetime').columns
    quantitative_cols = df.select_dtypes(include='number').columns 
    categorical_cols = list(set(cols) - set(quantitative_cols) - set(date_cols))
    quantitative_cols = set(quantitative_cols) - set(date_cols)
    
    return categorical_cols, list(quantitative_cols), list(date_cols)
In [114]:
# Defining a function to examine categorical variables
def examine_categorical(categ_var : pd.Series, top : int = 10, others : bool = True) -> None:
    '''
    This function gets a Pandas DataSeries (categorical column of the Pandas DataFrame) and: 
    - Gets the top 10 (or other chosen quantity) values
    - Compiles all the other values into "others" (or not, if chosen otherwise)
    - Prints a frequency distribution table
    - Plots a pie chart
    - Plots a bar chart
    '''
    
    vc = categ_var.value_counts()
    vc2 = vc.sort_values(ascending=False)[:top]
    new_row = pd.Series(data = {'others': vc.sort_values(ascending=False)[top:].sum()})
    vc3 = pd.concat([vc2, new_row])
    
    if others == True:
        vc = vc3
        msg = f'''Please, note that, for many reasons, only the top {top} values were considered to these calculations and visualizations.
All the other values were compiled into the "others" name.'''
    else:
        vc = vc2
        msg = f'''Please, note that, for many reasons, only the top {top} values were considered to these calculations and visualizations.'''
    
    # Frequency distribution
    print(f'''Frequency distribution table for different values of "{categ_var.name}" variable: \n\n{vc}\n''')
    
    print(msg)
    
    
    # Pie chart
    vc.plot(
    kind='pie',
    ylabel=categ_var.name,
    autopct='%.2f%%',
    figsize=(10,10))
    
    plt.show()
    plt.close()
    
    # Bar chart
    bar = vc.plot(
        kind='bar',
        figsize=(10,8),
        align='center')
    
    bar.bar_label(bar.containers[0])
    
    plt.show()
    plt.close()
    
    print('_' * 120+'\n' * 3)
In [115]:
# Defining a function to examine numerical variables
def examine_quant(
    variable:pd.Series,
    optmize_n_bins:bool=False,
    no_outliers:bool=False,
    n_bins:bool=False
)->None:
    '''
    Gets a Pandas DataSeries and: 
    - Prints measures of central tendancy
    - Prints measures of spread
    - Take the outliers out using the 1.5 IQR criterion, if "no_outliers" == True
    - Try to calculate the optimal number of bins for the histogram, if "optmize_n_bins" == True
    - Plots a histogram
    - Plots box-plot
    '''
    
    var_desc = variable.describe()
    
    IQR = var_desc['75%'] - var_desc['25%']
    
    print(f'''### Measures for variable '{variable.name}':
    
## Measures of center:
Mode: {variable.mode()[0]}
Mean: {var_desc['mean']}
Median: {var_desc['50%']}

## Measures of spread:
Min: {var_desc['min']}
Max: {var_desc['max']}
Range: {var_desc['max'] - var_desc['min']}

1st Quartile (Q25): {var_desc['25%']}
3rd Quartile (Q75): {var_desc['75%']}
IQR: {IQR}

Standard deviation: {var_desc['std']}\n''')
    
    if no_outliers == True:
        variable = variable[(variable <= (var_desc['75%'] + 1.5 * IQR)) & (variable >= (var_desc['25%'] - 1.5 * IQR))]
        
    def freedman_diaconis(variable : np.ndarray) -> int:
        """
        Use Freedman Diaconis rule to compute optimal histogram bin width - it tries to return the optimal number of bins. 
        """

        data = np.asarray(variable.values, dtype=np.float_)
        IQR  = stats.iqr(data, rng=(25, 75), scale=1.0, nan_policy='propagate')
        N    = data.size
        bw   = (2 * IQR) / np.power(N, 1/3)

        datmin, datmax = data.min(), data.max()
        datrng = datmax - datmin
        
        result = int(((datrng / bw) + 1)/5)

        return result
    
    #Histogram
    if optmize_n_bins:
        try:
            n_bins_ = freedman_diaconis(variable)
        except Exception as e:
            print(e)
        else:            
            variable.hist(bins=n_bins_)
            plt.show()
            plt.close()
    elif n_bins:
        variable.hist(bins=n_bins)
        plt.show()
        plt.close()
    else:
        variable.hist()
        plt.show()
        plt.close()
    
    #Boxplot
    plt.boxplot(x=variable, labels=[variable.name])
    plt.ylabel(variable.name)
    plt.show()
    plt.close()
    
    #Separator line
    print('_' * 120+'\n' * 3)

3.5.1. Visualizing the Data - Logs DF¶

In [97]:
mpl.rcParams['font.family'] = ['serif']
In [39]:
# Defining better settings for the visualizations using Seaborn
sns.set(rc={'figure.figsize':(8,6)}, style='whitegrid')
In [52]:
# Classifying variables by their type
cat_cols_logs, quan_cols_logs, date_cols_logs = check_variables(df_logs)
In [56]:
print('Categorical variables:')
show(cat_cols_logs)
print('\nQuantitative variables:')
show(quan_cols_logs)
print('\nFull date/time variables:')
show(date_cols_logs)
Categorical variables:
['event_type', 'id_evm', 'system', 'description']
Quantitative variables:
[]
Full date/time variables:
['date_time']
Visualizing and Examining Categorical Variables¶
In [122]:
# Examining categorical variables
for variable in cat_cols_logs:
    examine_categorical(df_logs[variable], top=8, others=True)
Frequency distribution table for different values of "id_evm" variable: 

67305985    11761980
2070822        12840
2054533        12742
2053621        12712
2072283        12679
2036805        12648
2040850        12629
2086021        12607
others       6788241
dtype: int64

Please, note that, for many reasons, only the top 8 values were considered to these calculations and visualizations.
All the other values were compiled into the "others" name.
No description has been provided for this image
No description has been provided for this image
________________________________________________________________________________________________________________________



Frequency distribution table for different values of "system" variable: 

VOTA      16883946
GAP        1140956
ATUE        347426
SCUE        173027
LOGD         78854
INITJE       13284
VPP           1038
RED            357
others         190
dtype: int64

Please, note that, for many reasons, only the top 8 values were considered to these calculations and visualizations.
All the other values were compiled into the "others" name.
No description has been provided for this image
No description has been provided for this image
________________________________________________________________________________________________________________________



Frequency distribution table for different values of "description" variable: 

Aguardando digitação do título                1157222
Título digitado pelo mesário                  1126201
Tecla indevida pressionada                    1053952
Eleitor foi habilitado                        1035473
O voto do eleitor foi computado               1035029
Voto confirmado para [Presidente]             1034747
Solicita digital. Tentativa [1] de [4]         928686
Capturada a digital. Tentativa [1] de [4]      915983
others                                       10351785
dtype: int64

Please, note that, for many reasons, only the top 8 values were considered to these calculations and visualizations.
All the other values were compiled into the "others" name.
No description has been provided for this image
No description has been provided for this image
________________________________________________________________________________________________________________________



Frequency distribution table for different values of "event_type" variable: 

INFO       18460996
ALERTA       173218
ERRO           4752
EXTERNO         112
others            0
dtype: int64

Please, note that, for many reasons, only the top 8 values were considered to these calculations and visualizations.
All the other values were compiled into the "others" name.
No description has been provided for this image
No description has been provided for this image
________________________________________________________________________________________________________________________



There are no quantitative variables in df_logs.

In [123]:
# Examining quantitative variables with outliers
for variable in quan_cols_logs:
    try:
        examine_quant(df_logs[variable], optmize_n_bins=False, no_outliers=False, n_bins=False)
    except Exception as e:
        print(e)
In [124]:
# Examining quantitative variables without outliers
for variable in quan_cols_logs:
    try:
        examine_quant(df_logs[variable], optmize_n_bins=False, no_outliers=True, n_bins=False)
    except Exception as e:
        print(e)

3.5.2. Visualizing the Data - BUs DF¶

In [57]:
# Classifying variables by their type
cat_cols_bu, quan_cols_bu, date_cols_bu = check_variables(df_bu)
print('Categorical variables:')
show(cat_cols_bu)
print('\nQuantitative variables:')
show(quan_cols_bu)
print('\nFull date/time variables:')
show(date_cols_bu)
Categorical variables:
['municipio', 'zona', 'local', 'secao']
Quantitative variables:
['brancos_nulos',
 'qtdEleitoresAptos',
 'qtd_votos_13',
 'qtdComparecimento',
 'qtdEleitoresCompBiometrico',
 'qtd_votos_22']
Full date/time variables:
[]
Visualizing and Examining Categorical Variables¶
In [127]:
# Examining categorical variables
for variable in cat_cols_bu:
    examine_categorical(df_bu[variable], top=12, others=True)
Frequency distribution table for different values of "municipio" variable: 

58335      1816
12190      1678
54038      1453
17612      1414
58190      1090
6050        952
58041       902
3018        741
59013       694
59250       662
73440       609
17590       540
others    22093
dtype: int64

Please, note that, for many reasons, only the top 12 values were considered to these calculations and visualizations.
All the other values were compiled into the "others" name.
No description has been provided for this image
No description has been provided for this image
________________________________________________________________________________________________________________________



Frequency distribution table for different values of "zona" variable: 

1          1805
2          1548
5           974
7           941
4           773
29          759
98          652
19          607
50          599
8           568
3           540
97          539
others    24339
dtype: int64

Please, note that, for many reasons, only the top 12 values were considered to these calculations and visualizations.
All the other values were compiled into the "others" name.
No description has been provided for this image
No description has been provided for this image
________________________________________________________________________________________________________________________



Frequency distribution table for different values of "secao" variable: 

25          166
27          165
26          162
30          160
28          159
43          159
41          158
33          158
23          156
44          156
22          155
24          155
others    32735
dtype: int64

Please, note that, for many reasons, only the top 12 values were considered to these calculations and visualizations.
All the other values were compiled into the "others" name.
No description has been provided for this image
No description has been provided for this image
________________________________________________________________________________________________________________________



Frequency distribution table for different values of "local" variable: 

1015       1974
1023       1534
1031       1442
1040       1163
1058       1142
1066        988
1082        961
1074        826
1120        750
1112        743
1090        741
1104        660
others    21720
dtype: int64

Please, note that, for many reasons, only the top 12 values were considered to these calculations and visualizations.
All the other values were compiled into the "others" name.
No description has been provided for this image
No description has been provided for this image
________________________________________________________________________________________________________________________



Visualizing and Examining Quantitative Variables¶
In [128]:
# Examining quantitative variables with outliers
for variable in quan_cols_bu:
    try:
        examine_quant(df_bu[variable], optmize_n_bins=False, no_outliers=False, n_bins=False)
    except Exception as e:
        print(e)
### Measures for variable 'qtd_votos_22':
    
## Measures of center:
Mode: 0
Mean: 109.83163029673248
Median: 110.0

## Measures of spread:
Min: 0.0
Max: 357.0
Range: 357.0

1st Quartile (Q25): 64.0
3rd Quartile (Q75): 155.0
IQR: 91.0

Standard deviation: 57.0083866624702

No description has been provided for this image
No description has been provided for this image
________________________________________________________________________________________________________________________



### Measures for variable 'qtdEleitoresAptos':
    
## Measures of center:
Mode: 393
Mean: 323.0540930608475
Median: 330.0

## Measures of spread:
Min: 22.0
Max: 552.0
Range: 530.0

1st Quartile (Q25): 282.0
3rd Quartile (Q75): 372.0
IQR: 90.0

Standard deviation: 69.95317971623534

No description has been provided for this image
No description has been provided for this image
________________________________________________________________________________________________________________________



### Measures for variable 'brancos_nulos':
    
## Measures of center:
Mode: 8
Mean: 8.493418773813648
Median: 8.0

## Measures of spread:
Min: 0.0
Max: 36.0
Range: 36.0

1st Quartile (Q25): 5.0
3rd Quartile (Q75): 11.0
IQR: 6.0

Standard deviation: 4.593713047585358

No description has been provided for this image
No description has been provided for this image
________________________________________________________________________________________________________________________



### Measures for variable 'qtdComparecimento':
    
## Measures of center:
Mode: 268
Mean: 255.61413231728437
Median: 261.0

## Measures of spread:
Min: 13.0
Max: 471.0
Range: 458.0

1st Quartile (Q25): 217.0
3rd Quartile (Q75): 297.0
IQR: 80.0

Standard deviation: 60.862333071411435

No description has been provided for this image
No description has been provided for this image
________________________________________________________________________________________________________________________



### Measures for variable 'qtdEleitoresCompBiometrico':
    
## Measures of center:
Mode: 258
Mean: 221.14672093291767
Median: 225.0

## Measures of spread:
Min: 3.0
Max: 425.0
Range: 422.0

1st Quartile (Q25): 179.0
3rd Quartile (Q75): 265.0
IQR: 86.0

Standard deviation: 64.89242566941138

No description has been provided for this image
No description has been provided for this image
________________________________________________________________________________________________________________________



### Measures for variable 'qtd_votos_13':
    
## Measures of center:
Mode: 111
Mean: 133.9214871261979
Median: 128.0

## Measures of spread:
Min: 0.0
Max: 346.0
Range: 346.0

1st Quartile (Q25): 98.0
3rd Quartile (Q75): 170.0
IQR: 72.0

Standard deviation: 52.067254609204234

No description has been provided for this image
No description has been provided for this image
________________________________________________________________________________________________________________________



In [130]:
# Examining quantitative variables without outliers
for variable in quan_cols_bu:
    try:
        examine_quant(df_bu[variable], optmize_n_bins=False, no_outliers=True, n_bins=False)
    except Exception as e:
        print(e)
### Measures for variable 'qtd_votos_22':
    
## Measures of center:
Mode: 0
Mean: 109.83163029673248
Median: 110.0

## Measures of spread:
Min: 0.0
Max: 357.0
Range: 357.0

1st Quartile (Q25): 64.0
3rd Quartile (Q75): 155.0
IQR: 91.0

Standard deviation: 57.0083866624702

No description has been provided for this image
No description has been provided for this image
________________________________________________________________________________________________________________________



### Measures for variable 'qtdEleitoresAptos':
    
## Measures of center:
Mode: 393
Mean: 323.0540930608475
Median: 330.0

## Measures of spread:
Min: 22.0
Max: 552.0
Range: 530.0

1st Quartile (Q25): 282.0
3rd Quartile (Q75): 372.0
IQR: 90.0

Standard deviation: 69.95317971623534

No description has been provided for this image
No description has been provided for this image
________________________________________________________________________________________________________________________



### Measures for variable 'brancos_nulos':
    
## Measures of center:
Mode: 8
Mean: 8.493418773813648
Median: 8.0

## Measures of spread:
Min: 0.0
Max: 36.0
Range: 36.0

1st Quartile (Q25): 5.0
3rd Quartile (Q75): 11.0
IQR: 6.0

Standard deviation: 4.593713047585358

No description has been provided for this image
No description has been provided for this image
________________________________________________________________________________________________________________________



### Measures for variable 'qtdComparecimento':
    
## Measures of center:
Mode: 268
Mean: 255.61413231728437
Median: 261.0

## Measures of spread:
Min: 13.0
Max: 471.0
Range: 458.0

1st Quartile (Q25): 217.0
3rd Quartile (Q75): 297.0
IQR: 80.0

Standard deviation: 60.862333071411435

No description has been provided for this image
No description has been provided for this image
________________________________________________________________________________________________________________________



### Measures for variable 'qtdEleitoresCompBiometrico':
    
## Measures of center:
Mode: 258
Mean: 221.14672093291767
Median: 225.0

## Measures of spread:
Min: 3.0
Max: 425.0
Range: 422.0

1st Quartile (Q25): 179.0
3rd Quartile (Q75): 265.0
IQR: 86.0

Standard deviation: 64.89242566941138

No description has been provided for this image
No description has been provided for this image
________________________________________________________________________________________________________________________



### Measures for variable 'qtd_votos_13':
    
## Measures of center:
Mode: 111
Mean: 133.9214871261979
Median: 128.0

## Measures of spread:
Min: 0.0
Max: 346.0
Range: 346.0

1st Quartile (Q25): 98.0
3rd Quartile (Q75): 170.0
IQR: 72.0

Standard deviation: 52.067254609204234

No description has been provided for this image
No description has been provided for this image
________________________________________________________________________________________________________________________



4. Clustering ^¶

4.1 Preprocessing the Data¶

In [385]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler

4.1.1. Logs DF¶

Label Encoding¶
In [31]:
labelencoder = LabelEncoder()
In [32]:
df_logs
Out[32]:
date_time event_type id_evm system description
0 2022-02-10 17:15:10 INFO 67305985 VOTA Imprimindo relatório [BU] via nº [2]
1 2022-02-10 17:15:47 INFO 67305985 VOTA Imprimindo relatório [BU] via nº [3]
2 2022-02-10 17:16:25 INFO 67305985 VOTA Imprimindo relatório [BU] via nº [4]
3 2022-02-10 17:17:02 INFO 67305985 VOTA Imprimindo relatório [BU] via nº [5]
4 2022-02-10 17:17:11 INFO 67305985 VOTA Imprimindo relatório [BUJ] via nº [1]
... ... ... ... ... ...
2257036 2022-02-10 17:02:15 INFO 2074298 VOTA Dedo reconhecido e o score para habilitá-lo. P...
2257037 2022-02-10 17:02:17 INFO 2074298 VOTA Tipo de habilitação do eleitor [biométrica]
2257038 2022-02-10 17:02:17 INFO 2074298 VOTA Eleitor foi habilitado
2257039 2022-02-10 17:02:29 INFO 2074298 VOTA Tecla indevida pressionada
2257040 2022-02-10 17:02:35 INFO 2074298 VOTA Voto confirmado para [Deputado Federal]

2257041 rows × 5 columns

In [58]:
for column in cat_cols_logs:
    df_logs[column+'_enc'] = labelencoder.fit_transform(df_logs[column])
In [50]:
df_logs
Out[50]:
date_time event_type id_evm system description event_type_enc date_time_enc system_enc description_enc
0 2022-02-10 17:15:10 INFO 67305985 VOTA Imprimindo relatório [BU] via nº [2] 3 43435 8 24567
1 2022-02-10 17:15:47 INFO 67305985 VOTA Imprimindo relatório [BU] via nº [3] 3 43472 8 24568
2 2022-02-10 17:16:25 INFO 67305985 VOTA Imprimindo relatório [BU] via nº [4] 3 43510 8 24569
3 2022-02-10 17:17:02 INFO 67305985 VOTA Imprimindo relatório [BU] via nº [5] 3 43547 8 24570
4 2022-02-10 17:17:11 INFO 67305985 VOTA Imprimindo relatório [BUJ] via nº [1] 3 43556 8 24564
... ... ... ... ... ... ... ... ... ...
2257036 2022-02-10 17:02:15 INFO 2074298 VOTA Dedo reconhecido e o score para habilitá-lo. P... 3 42660 8 3331
2257037 2022-02-10 17:02:17 INFO 2074298 VOTA Tipo de habilitação do eleitor [biométrica] 3 42662 8 27563
2257038 2022-02-10 17:02:17 INFO 2074298 VOTA Eleitor foi habilitado 3 42662 8 22381
2257039 2022-02-10 17:02:29 INFO 2074298 VOTA Tecla indevida pressionada 3 42674 8 27533
2257040 2022-02-10 17:02:35 INFO 2074298 VOTA Voto confirmado para [Deputado Federal] 3 42680 8 30011

2257041 rows × 9 columns

In [59]:
df_logs.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2257041 entries, 0 to 2257040
Data columns (total 10 columns):
 #   Column           Dtype         
---  ------           -----         
 0   date_time        datetime64[ns]
 1   event_type       category      
 2   id_evm           category      
 3   system           category      
 4   description      object        
 5   event_type_enc   int64         
 6   date_time_enc    int64         
 7   system_enc       int64         
 8   description_enc  int64         
 9   id_evm_enc       int64         
dtypes: category(3), datetime64[ns](1), int64(5), object(1)
memory usage: 144.2+ MB
In [388]:
X_logs = df_logs[['event_type_enc', 'description_enc', 'system_enc', 'id_evm_enc']]
In [389]:
X_logs
Out[389]:
event_type_enc description_enc system_enc id_evm_enc
0 3 24567 8 87
1 3 24568 8 87
2 3 24569 8 87
3 3 24570 8 87
4 3 24564 8 87
... ... ... ... ...
2257036 3 3331 8 48
2257037 3 27563 8 48
2257038 3 22381 8 48
2257039 3 27533 8 48
2257040 3 30011 8 48

2257041 rows × 4 columns

Scaling the data¶
In [390]:
sc = MinMaxScaler(feature_range = (0,1))
X_logs_scaled = sc.fit_transform(X_logs)
In [391]:
X_logs_scaled
Out[391]:
array([[1.        , 0.81835443, 0.88888889, 1.        ],
       [1.        , 0.81838774, 0.88888889, 1.        ],
       [1.        , 0.81842105, 0.88888889, 1.        ],
       ...,
       [1.        , 0.74553631, 0.88888889, 0.55172414],
       [1.        , 0.91715523, 0.88888889, 0.55172414],
       [1.        , 0.9997002 , 0.88888889, 0.55172414]])
In [62]:
X_logs.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2257041 entries, 0 to 2257040
Data columns (total 4 columns):
 #   Column           Dtype
---  ------           -----
 0   event_type_enc   int64
 1   description_enc  int64
 2   system_enc       int64
 3   id_evm_enc       int64
dtypes: int64(4)
memory usage: 86.1 MB

4.1.2. BUs DF¶

Translating features names to English¶
In [63]:
df_bu.rename(columns={
    'municipio' : 'municipality',
    'zona' : 'zone',
    'secao' : 'section',
    'qtdEleitoresCompBiometrico' : 'qty_voters_with_biometrics',
    'qtdEleitoresAptos' : 'qty_voters_able_to_vote',
    'qtdComparecimento' : 'qty_attendance',
    'qtd_votos_13' : 'qty_votes_on_13',
    'qtd_votos_22' : 'qty_votes_on_22',
    'brancos_nulos' : 'qty_blank_and_null_votes',
}, inplace=True)
In [64]:
# Classifying variables by their type again
cat_cols_bu, quan_cols_bu, date_cols_bu = check_variables(df_bu)
print('Categorical variables:')
show(cat_cols_bu)
print('\nQuantitative variables:')
show(quan_cols_bu)
print('\nFull date/time variables:')
show(date_cols_bu)
Categorical variables:
['section', 'zone', 'local', 'municipality']
Quantitative variables:
['qty_votes_on_13',
 'qty_attendance',
 'qty_voters_with_biometrics',
 'qty_votes_on_22',
 'qty_voters_able_to_vote',
 'qty_blank_and_null_votes']
Full date/time variables:
[]
Label Encoding¶
In [65]:
labelencoder = LabelEncoder()
In [66]:
for column in cat_cols_bu:
    df_bu[column+'_enc'] = labelencoder.fit_transform(df_bu[column])
In [67]:
df_bu
Out[67]:
local municipality zone section qty_voters_with_biometrics qty_voters_able_to_vote qty_attendance qty_votes_on_13 qty_votes_on_22 qty_blank_and_null_votes section_enc zone_enc local_enc municipality_enc
0 1406 18376 51 92 199 243 208 112 91 5 91 50 39 265
1 1414 8052 35 71 185 281 220 191 25 4 70 34 40 64
2 1031 17655 12 58 204 274 217 147 61 9 57 11 2 229
3 1325 4014 7 188 238 310 268 218 44 6 187 6 31 27
4 1074 9539 49 23 158 266 183 116 64 3 22 48 6 94
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
34639 1210 58190 76 57 115 246 176 50 117 9 56 74 20 359
34640 1040 58300 255 192 223 367 260 128 129 3 191 134 3 366
34641 1171 58092 42 18 191 374 290 117 161 12 17 41 16 351
34642 1120 54259 311 191 247 370 291 119 165 7 190 147 11 335
34643 1066 73334 34 92 201 272 202 138 58 6 91 33 5 415

34644 rows × 14 columns

In [68]:
df_bu.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 34644 entries, 0 to 34643
Data columns (total 14 columns):
 #   Column                      Non-Null Count  Dtype   
---  ------                      --------------  -----   
 0   local                       34644 non-null  category
 1   municipality                34644 non-null  category
 2   zone                        34644 non-null  category
 3   section                     34644 non-null  category
 4   qty_voters_with_biometrics  34644 non-null  int64   
 5   qty_voters_able_to_vote     34644 non-null  int64   
 6   qty_attendance              34644 non-null  int64   
 7   qty_votes_on_13             34644 non-null  int64   
 8   qty_votes_on_22             34644 non-null  int64   
 9   qty_blank_and_null_votes    34644 non-null  int64   
 10  section_enc                 34644 non-null  int64   
 11  zone_enc                    34644 non-null  int64   
 12  local_enc                   34644 non-null  int64   
 13  municipality_enc            34644 non-null  int64   
dtypes: category(4), int64(10)
memory usage: 3.2 MB
In [69]:
X_bu = df_bu[quan_cols_bu + [column+'_enc' for column in cat_cols_bu]]
In [70]:
X_bu
Out[70]:
qty_votes_on_13 qty_attendance qty_voters_with_biometrics qty_votes_on_22 qty_voters_able_to_vote qty_blank_and_null_votes section_enc zone_enc local_enc municipality_enc
0 112 208 199 91 243 5 91 50 39 265
1 191 220 185 25 281 4 70 34 40 64
2 147 217 204 61 274 9 57 11 2 229
3 218 268 238 44 310 6 187 6 31 27
4 116 183 158 64 266 3 22 48 6 94
... ... ... ... ... ... ... ... ... ... ...
34639 50 176 115 117 246 9 56 74 20 359
34640 128 260 223 129 367 3 191 134 3 366
34641 117 290 191 161 374 12 17 41 16 351
34642 119 291 247 165 370 7 190 147 11 335
34643 138 202 201 58 272 6 91 33 5 415

34644 rows × 10 columns

In [71]:
X_bu.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 34644 entries, 0 to 34643
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 0   qty_votes_on_13             34644 non-null  int64
 1   qty_attendance              34644 non-null  int64
 2   qty_voters_with_biometrics  34644 non-null  int64
 3   qty_votes_on_22             34644 non-null  int64
 4   qty_voters_able_to_vote     34644 non-null  int64
 5   qty_blank_and_null_votes    34644 non-null  int64
 6   section_enc                 34644 non-null  int64
 7   zone_enc                    34644 non-null  int64
 8   local_enc                   34644 non-null  int64
 9   municipality_enc            34644 non-null  int64
dtypes: int64(10)
memory usage: 2.9 MB
In [72]:
X_bu_quant = df_bu[quan_cols_bu]
In [73]:
X_bu_quant
Out[73]:
qty_votes_on_13 qty_attendance qty_voters_with_biometrics qty_votes_on_22 qty_voters_able_to_vote qty_blank_and_null_votes
0 112 208 199 91 243 5
1 191 220 185 25 281 4
2 147 217 204 61 274 9
3 218 268 238 44 310 6
4 116 183 158 64 266 3
... ... ... ... ... ... ...
34639 50 176 115 117 246 9
34640 128 260 223 129 367 3
34641 117 290 191 161 374 12
34642 119 291 247 165 370 7
34643 138 202 201 58 272 6

34644 rows × 6 columns

Feature Engineering¶
In [74]:
# Let's create a new feature: attendance_rate
df_bu['attendance_rate'] = df_bu.qty_attendance / df_bu.qty_voters_able_to_vote
In [75]:
# Let's scale it, so it become from 0 to 100, instead of 0 to 1
df_bu['attendance_rate'] = df_bu['attendance_rate'] * 100
In [76]:
X_bu_quant['attendance_rate'] = df_bu['attendance_rate']
In [77]:
X_bu_quant
Out[77]:
qty_votes_on_13 qty_attendance qty_voters_with_biometrics qty_votes_on_22 qty_voters_able_to_vote qty_blank_and_null_votes attendance_rate
0 112 208 199 91 243 5 85.596708
1 191 220 185 25 281 4 78.291815
2 147 217 204 61 274 9 79.197080
3 218 268 238 44 310 6 86.451613
4 116 183 158 64 266 3 68.796992
... ... ... ... ... ... ... ...
34639 50 176 115 117 246 9 71.544715
34640 128 260 223 129 367 3 70.844687
34641 117 290 191 161 374 12 77.540107
34642 119 291 247 165 370 7 78.648649
34643 138 202 201 58 272 6 74.264706

34644 rows × 7 columns

In [78]:
X_bu_quant.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 34644 entries, 0 to 34643
Data columns (total 7 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   qty_votes_on_13             34644 non-null  int64  
 1   qty_attendance              34644 non-null  int64  
 2   qty_voters_with_biometrics  34644 non-null  int64  
 3   qty_votes_on_22             34644 non-null  int64  
 4   qty_voters_able_to_vote     34644 non-null  int64  
 5   qty_blank_and_null_votes    34644 non-null  int64  
 6   attendance_rate             34644 non-null  float64
dtypes: float64(1), int64(6)
memory usage: 2.1 MB

4.2. K-Means¶

In [79]:
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D

4.2.1. Logs DF¶

In [392]:
model = KMeans(random_state=1, n_init=10)
visualizer = KElbowVisualizer(model, k=(2,10))

visualizer.fit(X_logs_scaled)
visualizer.show()
plt.show()
No description has been provided for this image
Choosing optimal K¶

In order to find an appropriate number of clusters, the elbow method was used. In this method for this case, the inertia for a number of clusters between 2 and 10 will be calculated. The rule is to choose the number of clusters where you see a kink or "an elbow" in the graph.

The graph above shows the reduction of a distortion score as the number of clusters increases. However, there is no clear "elbow" visible. The underlying algorithm suggests 4 clusters. A choice of 4 or 5 clusters seems to be fair.

Building the model and getting the clusters¶
In [395]:
KM_5_clusters_logs = KMeans(n_clusters=5, init='k-means++', random_state=123, n_init=10).fit(X_logs_scaled) # initialise and fit K-Means model

KM_5_clustered_logs = X_logs.copy()
KM_5_clustered_logs.loc[:,'Cluster'] = KM_5_clusters_logs.labels_ # append labels to points
In [396]:
KM_5_clustered_logs
Out[396]:
event_type_enc description_enc system_enc id_evm_enc Cluster
0 3 24567 8 87 0
1 3 24568 8 87 0
2 3 24569 8 87 0
3 3 24570 8 87 0
4 3 24564 8 87 0
... ... ... ... ... ...
2257036 3 3331 8 48 4
2257037 3 27563 8 48 2
2257038 3 22381 8 48 2
2257039 3 27533 8 48 2
2257040 3 30011 8 48 2

2257041 rows × 5 columns

Visualizing the clusters¶
In [398]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))


scat_1 = sns.scatterplot(
    KM_5_clustered_logs,
    x='event_type_enc',
    y='system_enc',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM_5_clustered_logs,
    x='id_evm_enc',
    y='event_type_enc',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

axes[0].scatter(KM_5_clusters_logs.cluster_centers_[:,1],KM_5_clusters_logs.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters_logs.cluster_centers_[:,0],KM_5_clusters_logs.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
In [399]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))


scat_1 = sns.scatterplot(
    KM_5_clustered_logs,
    x='description_enc',
    y='id_evm_enc',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM_5_clustered_logs,
    x='description_enc',
    y='event_type_enc',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

axes[0].scatter(KM_5_clusters_logs.cluster_centers_[:,1],KM_5_clusters_logs.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters_logs.cluster_centers_[:,0],KM_5_clusters_logs.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
Checking the size of the clusters¶
In [124]:
KM4_clust_sizes = KM4_clustered.groupby('Cluster').size().to_frame()
KM4_clust_sizes.columns = ["KM_size"]
KM4_clust_sizes
Out[124]:
KM_size
Cluster
0 540974
1 1127412
2 61606
3 527049

4.2.2. BUs DF - with labeled categorical features¶

In [125]:
model = KMeans(random_state=1, n_init=10)
visualizer = KElbowVisualizer(model, k=(2,10))

visualizer.fit(X_bu)
visualizer.show()
plt.show()
No description has been provided for this image
Choosing optimal K¶

In order to find an appropriate number of clusters, the elbow method was used. In this method for this case, the inertia for a number of clusters between 2 and 10 will be calculated. The rule is to choose the number of clusters where you see a kink or "an elbow" in the graph.

The graph above shows the reduction of a distortion score as the number of clusters increases. However, there is no clear "elbow" visible. The underlying algorithm suggests 4 clusters. A choice of 4 or 5 clusters seems to be fair.

Building the model and getting the clusters¶
In [126]:
KM_5_clusters = KMeans(n_clusters=5, init='k-means++', random_state=123, n_init=10).fit(X_bu) # initialise and fit K-Means model

KM5_clustered = X_bu.copy()
KM5_clustered.loc[:,'Cluster'] = KM_5_clusters.labels_ # append labels to points
In [127]:
KM5_clustered
Out[127]:
qty_votes_on_13 qty_attendance qty_voters_with_biometrics qty_votes_on_22 qty_voters_able_to_vote qty_blank_and_null_votes section_enc zone_enc local_enc municipality_enc Cluster
0 112 208 199 91 243 5 91 50 39 265 1
1 191 220 185 25 281 4 70 34 40 64 4
2 147 217 204 61 274 9 57 11 2 229 4
3 218 268 238 44 310 6 187 6 31 27 0
4 116 183 158 64 266 3 22 48 6 94 4
... ... ... ... ... ... ... ... ... ... ... ...
34639 50 176 115 117 246 9 56 74 20 359 1
34640 128 260 223 129 367 3 191 134 3 366 1
34641 117 290 191 161 374 12 17 41 16 351 1
34642 119 291 247 165 370 7 190 147 11 335 1
34643 138 202 201 58 272 6 91 33 5 415 1

34644 rows × 11 columns

Visualizing the clusters¶
In [128]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))

scat_1 = sns.scatterplot(
    KM5_clustered,
    x='municipality_enc',
    y='qty_votes_on_22',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM5_clustered,
    x='municipality_enc',
    y='qty_votes_on_13',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

axes[0].scatter(KM_5_clusters.cluster_centers_[:,1],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters.cluster_centers_[:,0],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
No description has been provided for this image
In [129]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))

scat_1 = sns.scatterplot(
    KM5_clustered,
    x='qty_voters_with_biometrics',
    y='qty_votes_on_22',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM5_clustered,
    x='qty_voters_with_biometrics',
    y='qty_votes_on_13',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

axes[0].scatter(KM_5_clusters.cluster_centers_[:,1],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters.cluster_centers_[:,0],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
No description has been provided for this image
In [130]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))

scat_1 = sns.scatterplot(
    KM5_clustered,
    x='qty_voters_with_biometrics',
    y='qty_blank_and_null_votes',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM5_clustered,
    x='qty_blank_and_null_votes',
    y='qty_votes_on_22',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

axes[0].scatter(KM_5_clusters.cluster_centers_[:,1],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters.cluster_centers_[:,0],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
No description has been provided for this image
Checking the size of the clusters¶
In [131]:
KM5_clust_sizes = KM5_clustered.groupby('Cluster').size().to_frame()
KM5_clust_sizes.columns = ["KM_size"]
KM5_clust_sizes
Out[131]:
KM_size
Cluster
0 8840
1 11049
2 3456
3 4427
4 6872

4.2.2. BUs DF - without categorical features¶

In [132]:
model = KMeans(random_state=1, n_init=10)
visualizer = KElbowVisualizer(model, k=(2,10))

visualizer.fit(X_bu_quant)
visualizer.show()
plt.show()
No description has been provided for this image
Choosing optimal K¶

In order to find an appropriate number of clusters, the elbow method was used. In this method for this case, the inertia for a number of clusters between 2 and 10 will be calculated. The rule is to choose the number of clusters where you see a kink or "an elbow" in the graph.

The graph above shows the reduction of a distortion score as the number of clusters increases. However, there is no clear "elbow" visible. The underlying algorithm suggests 4 clusters. A choice of 4 or 5 clusters seems to be fair.

Building the model and getting the clusters¶
In [133]:
KM_5_clusters_only_quant = KMeans(n_clusters=5, init='k-means++', random_state=123, n_init=10).fit(X_bu_quant) # initialise and fit K-Means model

KM5_clustered_only_quant = X_bu_quant.copy()
KM5_clustered_only_quant.loc[:,'Cluster'] = KM_5_clusters_only_quant.labels_ # append labels to points
In [134]:
KM5_clustered_only_quant
Out[134]:
qty_votes_on_13 qty_attendance qty_voters_with_biometrics qty_votes_on_22 qty_voters_able_to_vote qty_blank_and_null_votes attendance_rate Cluster
0 112 208 199 91 243 5 85.596708 2
1 191 220 185 25 281 4 78.291815 4
2 147 217 204 61 274 9 79.197080 4
3 218 268 238 44 310 6 86.451613 4
4 116 183 158 64 266 3 68.796992 1
... ... ... ... ... ... ... ... ...
34639 50 176 115 117 246 9 71.544715 1
34640 128 260 223 129 367 3 70.844687 3
34641 117 290 191 161 374 12 77.540107 3
34642 119 291 247 165 370 7 78.648649 3
34643 138 202 201 58 272 6 74.264706 4

34644 rows × 8 columns

Visualizing the clusters¶
In [135]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))

scat_1 = sns.scatterplot(
    KM5_clustered_only_quant,
    x='qty_votes_on_13',
    y='qty_votes_on_22',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM5_clustered_only_quant,
    x='qty_blank_and_null_votes',
    y='qty_votes_on_13',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

#axes[0].scatter(KM_5_clusters_only_quant.cluster_centers_[:,1],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
#axes[1].scatter(KM_5_clusters_only_quant.cluster_centers_[:,0],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
No description has been provided for this image
In [136]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))

scat_1 = sns.scatterplot(
    KM5_clustered_only_quant,
    x='qty_voters_with_biometrics',
    y='qty_votes_on_22',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM5_clustered_only_quant,
    x='qty_voters_with_biometrics',
    y='qty_votes_on_13',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

axes[0].scatter(KM_5_clusters_only_quant.cluster_centers_[:,1],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters_only_quant.cluster_centers_[:,0],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
No description has been provided for this image
In [137]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))

scat_1 = sns.scatterplot(
    KM5_clustered_only_quant,
    x='qty_voters_with_biometrics',
    y='qty_blank_and_null_votes',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM5_clustered_only_quant,
    x='qty_voters_with_biometrics',
    y='qty_attendance',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

#axes[0].scatter(KM_5_clusters_only_quant.cluster_centers_[:,1],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters_only_quant.cluster_centers_[:,0],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
No description has been provided for this image
In [138]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))

sns.scatterplot(
    KM5_clustered_only_quant,
    x='qty_voters_able_to_vote',
    y='qty_blank_and_null_votes',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM5_clustered_only_quant,
    x='qty_voters_able_to_vote',
    y='qty_attendance',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

#axes[0].scatter(KM_5_clusters_only_quant.cluster_centers_[:,1],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
#axes[1].scatter(KM_5_clusters_only_quant.cluster_centers_[:,0],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
No description has been provided for this image
In [139]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))

sns.scatterplot(
    KM5_clustered_only_quant,
    x='attendance_rate',
    y='qty_votes_on_13',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM5_clustered_only_quant,
    x='attendance_rate',
    y='qty_votes_on_22',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

#axes[0].scatter(KM_5_clusters_only_quant.cluster_centers_[:,1],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
#axes[1].scatter(KM_5_clusters_only_quant.cluster_centers_[:,0],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
No description has been provided for this image
In [140]:
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))

sns.scatterplot(
    KM5_clustered_only_quant,
    x='qty_voters_with_biometrics',
    y='attendance_rate',
    hue='Cluster',
    ax=axes[0],
    palette='Set1',
    legend='full'
)

sns.scatterplot(
    KM5_clustered_only_quant,
    x='qty_voters_able_to_vote',
    y='attendance_rate',
    hue='Cluster',
    palette='Set1',
    ax=axes[1],
    legend='full')

#axes[0].scatter(KM_5_clusters_only_quant.cluster_centers_[:,1],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
#axes[1].scatter(KM_5_clusters_only_quant.cluster_centers_[:,0],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
No description has been provided for this image
Checking the size of the clusters¶
In [141]:
KM5_clust_sizes = KM5_clustered_only_quant.groupby('Cluster').size().to_frame()
KM5_clust_sizes.columns = ["KM_size"]
KM5_clust_sizes
Out[141]:
KM_size
Cluster
0 4373
1 5509
2 6849
3 9293
4 8620

4.3. DBSCAN¶

In DBSCAN there are two major hyperparameters:

  • eps
  • min_samples

It is difficult arbitrarily to say what values will work the best. Therefore, I will first create a matrix of combinations.

In [148]:
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
from itertools import product

4.3.1. Logs DF¶

After many tries, I realized that DBSCAN will not work with the Logs DF, since it always crashes the Jupyter Lab (regardless if it is in my own computer or in Google Colab or in a powerful machine in Google Cloud Platform Workbench).

This is probably due to the fact that this dataset is a huge one (it has more than 2,000,000 rows, and it is already much smaller), and also it has only categorical variables (even though they are already encoded).

4.3.1. BUs DF¶

Chosing optimal parameters¶
In [214]:
eps_values = np.arange(6,10.75,0.75) # eps values to be investigated
min_samples = np.arange(6,12) # min_samples values to be investigated

DBSCAN_params = list(product(eps_values, min_samples))
In [215]:
# Because DBSCAN creates clusters itself based on those two parameters let's check the number of generated clusters.

no_of_clusters = []
sil_score = []

for p in DBSCAN_params:
    DBS_clustering = DBSCAN(eps=p[0], min_samples=p[1]).fit(X_bu_quant)
    no_of_clusters.append(len(np.unique(DBS_clustering.labels_)))
    sil_score.append(silhouette_score(X_bu_quant, DBS_clustering.labels_))
In [193]:
tmp = pd.DataFrame.from_records(DBSCAN_params, columns =['Eps', 'Min_samples'])   
tmp['No_of_clusters'] = no_of_clusters

pivot_1 = pd.pivot_table(tmp, values='No_of_clusters', index='Min_samples', columns='Eps')

fig, ax = plt.subplots(figsize=(12,6))
sns.heatmap(pivot_1, annot=True,annot_kws={"size": 16}, cmap="YlGnBu", ax=ax)
ax.set_title('Number of clusters')
plt.show()
No description has been provided for this image

The heatplot immediately above shows that, with the given parameters, the number of clusters vary from 2 to 330. However, most of the combinations gives more than 20 clusters. Nevertheless, we can safely choose numbers located on the bottom-left or the bottom-right corner of the heatmap.

Cluster # 1¶
Building the model and getting the clusters¶
In [234]:
DBS_clustering = DBSCAN(eps=16, min_samples=14).fit(X_bu_quant)

DBSCAN_clustered = X_bu_quant.copy()
DBSCAN_clustered.loc[:,'Cluster'] = DBS_clustering.labels_ # append labels to points
Checking the size of the clusters¶
In [235]:
DBSCAN_clust_sizes = DBSCAN_clustered.groupby('Cluster').size().to_frame()
DBSCAN_clust_sizes.columns = ["DBSCAN_size"]
DBSCAN_clust_sizes
Out[235]:
DBSCAN_size
Cluster
-1 6579
0 27851
1 30
2 7
3 13
4 64
5 78
6 12
7 10

DBSCAN created 8 clusters plus outliers cluster (-1). Sizes of clusters vary significantly. There are 6579 outliers.

Cluster # 2¶
Building the model and getting the clusters¶
In [262]:
DBS_clustering = DBSCAN(eps=6, min_samples=10).fit(X_bu_quant)

DBSCAN_clustered = X_bu_quant.copy()
DBSCAN_clustered.loc[:,'Cluster'] = DBS_clustering.labels_ # append labels to points
Checking the size of the clusters¶
In [263]:
DBSCAN_clust_sizes = DBSCAN_clustered.groupby('Cluster').size().to_frame()
DBSCAN_clust_sizes.columns = ["DBSCAN_size"]
DBSCAN_clust_sizes
Out[263]:
DBSCAN_size
Cluster
-1 34613
0 11
1 10
2 10

DBSCAN created 3 clusters plus outliers cluster (-1). Sizes of clusters are almost the same. There are 34613 outliers in this cluster.

4.3. SOM¶

Self Organizing Map (SOM) is an unsupervised ANN that uses competitive learning to update its weights - i.e Competition, Cooperation and Adaptation.

Each neuron of the output layer is present with a vector with dimension n. The distance between each neuron present at the output layer and the input data is computed. The neuron with the lowest distance is termed as the most suitable fit.

Updating the vector of the suitable neuron in the final process is known as adaptation, along with its neighbour in cooperation. After selecting the suitable neuron and its neighbours, we process the neuron to update. The more the distance between the neuron and the input, the more the data grows.

In [216]:
from minisom import MiniSom 

4.3.1. BUs DF¶

In [309]:
X_bu_quant.values.shape # Let's just check the number of columns in the dataset
Out[309]:
(34644, 7)
Setup # 1¶
Building the model¶
In [307]:
neurons_a = 15
neurons_b = 15
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.1, sigma=1.5)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 10000, verbose=True)
 [ 10000 / 10000 ] 100% - 0:00:00 left 
 quantization error: 35.115005121036326
Visualizing the U-Matrix¶

The U-Matrix is a common way to visualize the results of a Self-Organizing Map (SOM). It is a 2D representation of the SOM's neurons and the distances between them, where each cell in the U-Matrix corresponds to a neuron in the SOM. The color of each cell represents the distance between that neuron and its neighbors.

Each cell in the U-Matrix corresponds to a neuron in the SOM, and the numbers inside the cells are the indexes of the neurons. The lines separating the cells represent the distances between the neurons.

In general, cells with similar colors in the U-Matrix tend to have similar input vectors assigned to them. This means that these neurons form clusters of similar data in the input space. The darker the color, the more similar are the vectors assigned to that neuron and its neighbors, indicating that these neurons are closer in the input space.

In [310]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
No description has been provided for this image
Setup # 2¶
Building the model¶
In [315]:
neurons_a = 15
neurons_b = 15
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.2, sigma=12)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 10000, verbose=True)
 [ 10000 / 10000 ] 100% - 0:00:00 left 
 quantization error: 54.98352278887444
Visualizing the U-Matrix¶
In [316]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
No description has been provided for this image
Setup # 3¶
Building the model¶
In [321]:
neurons_a = 15
neurons_b = 15
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.5, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 10000, verbose=True)
 [ 10000 / 10000 ] 100% - 0:00:00 left 
 quantization error: 30.96662533699219
Visualizing the U-Matrix¶
In [322]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
No description has been provided for this image
Setup # 4¶
Building the model¶
In [325]:
neurons_a = 15
neurons_b = 15
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.75, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 10000, verbose=True)
 [ 10000 / 10000 ] 100% - 0:00:00 left 
 quantization error: 29.3336298497645
Visualizing the U-Matrix¶
In [326]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
No description has been provided for this image
Setup # 5¶
Building the model¶
In [333]:
neurons_a = 15
neurons_b = 15
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.25, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 50000, verbose=True)
 [ 50000 / 50000 ] 100% - 0:00:00 left 
 quantization error: 24.236883025608197
Visualizing the U-Matrix¶
In [334]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
No description has been provided for this image
Setup # 6¶
Building the model¶
In [337]:
neurons_a = 15
neurons_b = 15
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.25, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 150000, verbose=True)
 [ 150000 / 150000 ] 100% - 0:00:00 left 
 quantization error: 22.86847569466429
Visualizing the U-Matrix¶
In [338]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
No description has been provided for this image
Setup # 6¶
Building the model¶
In [339]:
neurons_a = 25
neurons_b = 25
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.25, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 200000, verbose=True)
 [ 200000 / 200000 ] 100% - 0:00:00 left 
 quantization error: 18.63131306217731
Visualizing the U-Matrix¶
In [340]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
No description has been provided for this image
Setup # 6¶
Building the model¶
In [341]:
neurons_a = 40
neurons_b = 40
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.25, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 500000, verbose=True)
 [ 500000 / 500000 ] 100% - 0:00:00 left 
 quantization error: 16.029377127043546
Visualizing the U-Matrix¶
In [342]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
No description has been provided for this image
Setup # 7¶
Building the model¶
In [345]:
neurons_a = 35
neurons_b = 35
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.35, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 500000, verbose=True)
 [ 500000 / 500000 ] 100% - 0:00:00 left 
 quantization error: 15.207089114831074
Visualizing the U-Matrix¶
In [346]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
No description has been provided for this image
Setup # 8¶
Building the model¶
In [357]:
neurons_a = 10
neurons_b = 10
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.35, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 50000, verbose=True)
 [ 50000 / 50000 ] 100% - 0:00:00 left 
 quantization error: 29.85964189856346
Visualizing the U-Matrix¶
In [358]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
No description has been provided for this image
Generating Clusters¶

Of course, we can also generate clusters from SOMs. These clusters are based on the winner neurons (basically, each winner neuron represents a cluster).

In [359]:
# each winner neuron represents a cluster
winner_coordinates = np.array([som.winner(x) for x in X_bu_quant.values]).T
# with np.ravel_multi_index we convert the bidimensional
# coordinates to a monodimensional index
cluster_index = np.ravel_multi_index(winner_coordinates, (neurons_a, neurons_b))

# plotting the clusters using the first 2 dimentions of the data
for c in np.unique(cluster_index):
    plt.scatter(X_bu_quant.values[cluster_index == c, 0],
                X_bu_quant.values[cluster_index == c, 1], label='cluster='+str(c), alpha=.7)

# plotting centroids
#for centroid in som.get_weights():
#    plt.scatter(centroid[:, 0], centroid[:, 1], marker='x', 
#                s=80, linewidths=35, color='k', label='centroid')
plt.legend()
plt.show()
No description has been provided for this image
Setup # 9¶
Building the model¶
In [360]:
neurons_a = 3
neurons_b = 5
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.35, sigma=.5)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 50000, verbose=True)
 [ 50000 / 50000 ] 100% - 0:00:00 left 
 quantization error: 54.21844073175913
Visualizing the U-Matrix¶
In [361]:
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
No description has been provided for this image
Generating Clusters¶
In [362]:
# each winner neuron represents a cluster
winner_coordinates = np.array([som.winner(x) for x in X_bu_quant.values]).T
# with np.ravel_multi_index we convert the bidimensional
# coordinates to a monodimensional index
cluster_index = np.ravel_multi_index(winner_coordinates, (neurons_a, neurons_b))

# plotting the clusters using the first 2 dimentions of the data
for c in np.unique(cluster_index):
    plt.scatter(X_bu_quant.values[cluster_index == c, 0],
                X_bu_quant.values[cluster_index == c, 1], label='cluster='+str(c), alpha=.7)

# plotting centroids
#for centroid in som.get_weights():
#    plt.scatter(centroid[:, 0], centroid[:, 1], marker='x', 
#                s=80, linewidths=35, color='k', label='centroid')
plt.legend()
plt.show()
No description has been provided for this image
Setup # 9¶
Building the model¶
In [370]:
neurons_a = 2
neurons_b = 3
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.3, sigma=.5)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 50000, verbose=True)
 [ 50000 / 50000 ] 100% - 0:00:00 left 
 quantization error: 69.0330162978782
Generating Clusters¶
In [376]:
# Defining a function to make it easier and faster
def generate_cluster_from_som(dim1:int, dim2:int)->None:
    # each winner neuron represents a cluster
    winner_coordinates = np.array([som.winner(x) for x in X_bu_quant.values]).T
    # with np.ravel_multi_index we convert the bidimensional
    # coordinates to a monodimensional index
    cluster_index = np.ravel_multi_index(winner_coordinates, (neurons_a, neurons_b))

    # plotting the clusters using the first 2 dimentions of the data
    for c in np.unique(cluster_index):
        plt.scatter(X_bu_quant.values[cluster_index == c, dim1],
                    X_bu_quant.values[cluster_index == c, dim2], label='cluster='+str(c), alpha=.7)

    plt.xlabel(X_bu_quant.columns.to_list()[dim1])
    plt.ylabel(X_bu_quant.columns.to_list()[dim2])
    # plotting centroids
    #for centroid in som.get_weights():
    #    plt.scatter(centroid[:, 0], centroid[:, 1], marker='x', 
    #                s=80, linewidths=35, color='k', label='centroid')
    plt.legend()
    plt.show()
In [377]:
generate_cluster_from_som(0, 1)
No description has been provided for this image
In [378]:
generate_cluster_from_som(2, 0)
No description has been provided for this image
In [380]:
generate_cluster_from_som(2, 3)
No description has been provided for this image
In [381]:
generate_cluster_from_som(6, 3)
No description has been provided for this image
In [382]:
generate_cluster_from_som(6, 0)
No description has been provided for this image
In [383]:
generate_cluster_from_som(4, 6)
No description has been provided for this image
In [384]:
generate_cluster_from_som(2, 6)
No description has been provided for this image

5. Conclusions ^¶

5.1. Patterns Found in the Data¶

K-Means and SOM algorithms and models generated some interesting clusters and, although it is possible to explain clusters from different points of view (basing on different variables), there are some visualizations in which the clusters appear more and are much more readable.

Basically, K-means generated the following 5 clusters from the logs DF:

  • Red (0): EVMs with medium quantity of votes on 22 (Bolsonaro) and medium to high quantity of votes on 13 (Lula). This cluster also shows EVMs with high quantity of voters with biometrics and evms with a high attendance rate.

  • Blue (1): EVMs with low quantity of votes on 22 (Bolsonaro) and low quantity of votes on 13 (Lula). This cluster also shows EVMs with low quantity of voters with biometrics.

  • Green (2): EVMs with medium quantity of votes on 22 (Bolsonaro) and low quantity of votes on 13 (Lula).

  • Purple (3): EVMs with high quantity of votes on 22 (Bolsonaro) and medium quantity of votes on 13 (Lula).

  • Orange (4): EVMs with low quantity of votes on 22 (Bolsonaro) and high quantity of votes on 13 (Lula). This cluster also shows EVMs with medium quantity of voters with biometrics.

And SOM generated the following 6 clusters from the logs DF:

  • Dark Blue (0): EVMs with medium quantity of votes on 22 (Bolsonaro), medium to high quantity of votes on 13 (Lula), high attendance rate, and high quantity of voters with biometrics.

  • Green (1): EVMs with low quantity of votes on 22 (Bolsonaro), high quantity of votes on 13 (Lula), high attendance rate, and medium quantity of voters with biometrics.

  • Red (2): EVMs with low quantity of votes on 22 (Bolsonaro), medium quantity of votes on 13 (Lula), medium to high attendance rate, and medium quantity of voters with biometrics.

  • Purple (3): EVMs with low quantity of votes on 22 (Bolsonaro), low (the lowest) quantity of votes on 13 (Lula), medium to low attendance rate (the lowest), and low quantity of voters with biometrics.

  • Yellow (4): EVMs with high (the highest) quantity of votes on 22 (Bolsonaro), low quantity of votes on 13 (Lula), medium to high attendance rate, and medium quantity of voters with biometrics.

  • Light Blue (5): EVMs with low quantity of votes on 22 (Bolsonaro), low quantity of votes on 13 (Lula), medium to high attendance rate, and low quantity of voters with biometrics (the lowest).

5.2. Algorithms Comparison¶

Regarding the three clustering algorithms used in this notebook, we can say that K-Means is the easiest to use, whereas SOM is the fastest and also seems to be the most complete and most customizable one - coincidence or not (probably not), SOM is the only algorithm that is considered to be within the Deep Learning subset, since it is an Artificial Neural Network.

DBSCAN is very heavy and needs a lot of computing power and tweaking to be really effective. Moreover, DBSCAN seems to be more suitable for anomaly detection than for clustering itself although, unfortunately, I wasn't able to effectively detect anomalies using DBSCAN in this project due to the reasons I will still mention in these conclusions.

5.3. Problems and Challenges¶

The first major challenge in this project was the data ingestion phase, because it needed a really efficient web scraper (in which I invested a lot of time to develop it) and because of the huge volume of data: in total, there were used around 490,000 EVMs in the 2nd round of the Brazilian Federal Elections last year. I ended up downloading the zip files of around 110,000 EVMs, and I couldn't keep going because of the limitations of my computer and of my Google Cloud Platform account (yes, I had to use my GCP Vertex AI Workbench to this project, since my computer wasn't enough).

The second major challeng was, also, a problem and was related to the first one: the lack of computing power to properly process this huge amount of data. Even using cloud resources, I got many crashes and I had to decrease the size of the datasets (mainly the logs one) to keep working in this project. Trying to work with the whole data from all the logs and all the EVM's boulletins (Boletins de Urna - BUs) from 110,000 EVMs wasn't feasible at all.

The third and last problem was the type of the data of the logs. The fact that all the features of the logs were categorical didn't help in anything. In fact, it was the major reason why the K-Means clusters of the logs data didn't seem useful at all. I tried to solve this problem by label encoding it (since one-hot / dummy encoding wasn't feasible) and scaling it, but it didn't helped too much.

6. References ^¶

[1] Victor Dey, "Beginners Guide to Self-Organizing Maps", AIM, 2021
[2] Giuseppe Vettigli, "MiniSom", GitHUb, 2022
[3] Robert Kwiatkowski, "Customers clustering: K-Means, DBSCAN and AP", Kaggle, 2022
[4] Jorge Martín Lasaosa, "Clustering on numerical and categorical features", Towards Data Science, 2021